Voxtral Realtime: enable CUDA backend with int4 quantization#17798
Voxtral Realtime: enable CUDA backend with int4 quantization#17798mergennachin merged 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17798
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 1 Cancelled JobAs of commit e5c3690 with merge base 0907294 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
1e5399a to
afe08f0
Compare
afe08f0 to
50e3a3d
Compare
50e3a3d to
e708015
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
examples/models/voxtral_realtime/voxtral_realtime_runner.cpp:612
logits_to_token()recreates/reseeds aSampleron every decode step (seeded fromstd::time(nullptr)), sotemperature > 0sampling won’t have a stable RNG stream across tokens and can become repetitive. SinceStreamingSessionalready has asampler_member, it would be better to use that persistent sampler (with dtype switching for Float/BFloat16/Half) instead of callinglogits_to_token()each step.
prev_token_, static_cast<uint64_t>(next_token));
if (piece.ok()) {
token_cb_(*piece);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| python -m executorch.examples.models.voxtral_realtime.export_voxtral_rt \ | ||
| --model-path "$LOCAL_MODEL_DIR" \ | ||
| --backend "$DEVICE" \ | ||
| ${STREAMING_ARG} \ | ||
| --output-dir "${OUTPUT_DIR}" \ | ||
| ${VR_QUANT_ARGS} | ||
| ${VR_QUANT_ARGS} \ | ||
| ${VR_DTYPE_ARGS} |
There was a problem hiding this comment.
In the voxtral_realtime export path, the script doesn’t validate that the CUDA delegate data file (aoti_cuda_blob.ptd) was produced. Since the runner requires --data_path for CUDA, it’d be safer to add a test -f "${OUTPUT_DIR}/aoti_cuda_blob.ptd" check when DEVICE=cuda (similar to the Parakeet branch) so export failures are caught immediately.
e708015 to
05f0ed2
Compare
| |---------|---------|-----------|--------------| | ||
| | `xnnpack` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` | | ||
| | `metal` | ✓ | ✓ | none (fp32) or `fpa4w` (Metal-specific 4-bit) | | ||
| | `cuda` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` | |
There was a problem hiding this comment.
Does Cuda support 8da4w/8da8w?
Related, I'm pretty sure xnnpack does not support 4w/8w.
There was a problem hiding this comment.
Does Cuda support 8da4w/8da8w?
Good catch, will fix.
Related, I'm pretty sure xnnpack does not support 4w/8w.
xnnpack supports per-channel 4w and 8w. For example, we use 8w for token embeddings.
There was a problem hiding this comment.
ET's embedding CPU op supports weight only schemes, but I don't think xnnpack supports weight-only quantization for linear layers.
With that said, 4w/8da4w and 8w/8da8w quantize weight data the same. The only difference is the 8da variants add fake activation quantization in front.
There was a problem hiding this comment.
@manuelcandales is there any plan for metal aoti to use int4/int8 for a more uniform experience.
The kernel should support it because I'm using int4/int8 with MLX.
| --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ | ||
| --backend cuda \ | ||
| --dtype bf16 \ | ||
| --streaming \ |
There was a problem hiding this comment.
if this is supported, then why not test it in CI?
| fi | ||
| source .ci/scripts/export_model_artifact.sh cuda "${{ matrix.model.repo }}/${{ matrix.model.name }}" "${{ matrix.quant }}" "${RUNNER_ARTIFACT_DIR}" | ||
| # Voxtral Realtime uses offline mode for CUDA CI (not streaming) |
Add CUDA/AOTI backend support for the Voxtral Realtime model alongside the existing XNNPACK and Metal backends. Model (model.py): - CudaSDPA: F.scaled_dot_product_attention with repeat_interleave for GQA expansion and boolean attention masks (Triton SDPA requirement) - StaticKVCache (shared with Metal) for [B,H,S,D] layout with index_copy_ - StandardEncoderRingKVCache/StandardEncoderSDPA for streaming encoder - _build_causal_mask_bool: 4D boolean mask for Triton compatibility - Simplified LMAttention.forward to always pass attn_mask (None for XNNPACK) Export (export_voxtral_rt.py): - --backend cuda with CudaPartitioner and conv1d_to_conv2d decomposition - --dtype flag (default fp32, bf16 for CUDA Triton SDPA) - --qlinear-packing-format / --qlinear-encoder-packing-format for tile_packed_to_4d int4 quantization - CUDA device placement, Dim.AUTO for audio encoder, .ptd output Runner (main.cpp, voxtral_realtime_runner.cpp/.h): - --data_path flag for .ptd delegate data (CUDA compiled kernels) - Module two-arg constructor for pte+ptd loading Build (CMakePresets.json, Makefile): - voxtral-realtime-cuda preset - make voxtral_realtime-cuda target CI (.github/workflows/cuda.yml, .ci/scripts/): - Voxtral Realtime in CUDA CI matrix (int4-tile-packed, offline mode) - Export/test scripts updated for CUDA quantization args and data path
05f0ed2 to
e5c3690
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -474,15 +552,22 @@ def main(): | |||
|
|
|||
| os.makedirs(args.output_dir, exist_ok=True) | |||
|
|
|||
| # Load model | |||
| model_dtype = {"fp32": torch.float32, "bf16": torch.bfloat16}[args.dtype] | |||
|
|
|||
| print("Loading model...") | |||
| model = load_model( | |||
| args.model_path, | |||
| max_seq_len=args.max_seq_len, | |||
| n_delay_tokens=args.delay_tokens, | |||
| dtype=model_dtype, | |||
| backend=args.backend, | |||
| ) | |||
|
|
|||
| # Move to CUDA for CUDA backend export (AOTInductor needs CUDA tensors) | |||
| if args.backend == "cuda": | |||
| print("Moving model to CUDA...") | |||
| model.cuda() | |||
|
|
|||
There was a problem hiding this comment.
For --backend cuda, leaving --dtype at the current default (fp32) is likely to produce an exported model that fails at runtime/compile time once SDPA is replaced by the CUDA Triton triton::sdpa op, which currently enforces bfloat16 inputs. Consider either (a) making bf16 the default when --backend cuda, (b) erroring out if --backend cuda and --dtype fp32, or (c) automatically setting a CUDA compile spec (e.g., triton_kernel_mode=OFF) when exporting fp32 so SDPA falls back to a non-Triton implementation.
| # Add CUDA data path if present | ||
| if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then | ||
| RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd" | ||
| fi |
There was a problem hiding this comment.
This block appends --data_path ... for CUDA, but the script already adds --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd for all non-llama runners earlier (before the model-specific case). For Voxtral Realtime on CUDA this results in duplicate --data_path arguments. Please remove this per-model addition (or refactor the earlier common CUDA handling to avoid double-appending for voxtral_realtime).
| # Add CUDA data path if present | |
| if [ "$DEVICE" = "cuda" ] && [ -f "${MODEL_DIR}/aoti_cuda_blob.ptd" ]; then | |
| RUNNER_ARGS="$RUNNER_ARGS --data_path ${MODEL_DIR}/aoti_cuda_blob.ptd" | |
| fi |
Add CUDA/AOTI backend support for the Voxtral Realtime model alongside
the existing XNNPACK and Metal backends.
Model (model.py):
GQA expansion and boolean attention masks (Triton SDPA requirement)
Export (export_voxtral_rt.py):
tile_packed_to_4d int4 quantization
Runner (main.cpp, voxtral_realtime_runner.cpp/.h):
Build (CMakePresets.json, Makefile):
CI (.github/workflows/cuda.yml, .ci/scripts/):